Compact Tensor Pooling for Visual Question Answering
نویسندگان
چکیده
Performing high level cognitive tasks requires the integration of feature maps with drastically different structure. In Visual Question Answering (VQA) image descriptors have spatial structures, while lexical inputs inherently follow a temporal sequence. The recently proposed Multimodal Compact Bilinear pooling (MCB) forms the outer products, via count-sketch approximation, of the visual and textual representation at each spatial location. While this procedure preserves spatial information locally, outerproducts are taken independently for each fiber of the activation tensor, and therefore do not include spatial context. In this work, we introduce multi-dimensional sketch (MDsketch), a novel extension of count-sketch to tensors. Using this new formulation, we propose Multimodal Compact Tensor Pooling (MCT) to fully exploit the global spatial context during bilinear pooling operations. Contrarily to MCB, our approach preserves spatial context by directly convolving the MD-sketch from the visual tensor features with the text vector feature using higher order FFT. Furthermore we apply MCT incrementally at each step of the question embedding and accumulate the multi-modal vectors with a second LSTM layer before the final answer is chosen.
منابع مشابه
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise multiplication or addition, as well as concatenation of the visual a...
متن کاملBilinear Pooling and Co-Attention Inspired Models for Visual Question Answering
In recent years, open-ended visual question answering has been an area of active research. In this work, we present our exploration of two state-of-art architectures including the Multi-modal Compact Bi-linear Pooling (MCB) and Dynamic Memory Network (DMN) and analysis of the result and performance of the models. We found both models to perform comparably on the VQA v2.0 dataset based on predic...
متن کاملConvolutional Neural Tensor Network Architecture for Community-Based Question Answering
Retrieving similar questions is very important in community-based question answering. A major challenge is the lexical gap in sentence matching. In this paper, we propose a convolutional neural tensor network architecture to encode the sentences in semantic space and model their interactions with a tensor layer. Our model integrates sentence modeling and semantic matching into a single model, w...
متن کاملDon't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
A number of studies have found that today’s Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we...
متن کاملMultimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation
In state-of-the-art Neural Machine Translation, an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, whe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1706.06706 شماره
صفحات -
تاریخ انتشار 2017